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Abstract 


Optimizing the spread of influence in online social networks (OSN^ is important for the design 


of efficient viral marketing strategies using online recommendations 


01 


It is commonly believed 


that, spreading is a g. 
network information 


obal process, whose optimization would require the knowledge of the whole 
l|-[3|. Here we uncover a characteristic local length scale, called influence 
radius^ hidden in the global nature of spreading processes. We show that, any node’s influence to 
the entire OSN can be quantified from its local network environment within the influence radius, 
which is significant smaller than the whole network diameter. By mapping the problem onto 
bond percolation [^, we give a theoretical explanation for the presence of this short influence 
radius, and a framework to quantify individual’s influence in real OSNs. We then propose a 
scalable optimization algorithm to identify the most influential spreaders. The time complexity 
of our algorithm is independent of network size, and its performance is remarkably close the true 
optimum. Our method may be applied to other large scale spreading problems, such as the world¬ 
wide epidemic control. 


* Electronic address: yanqing.hu.sc@gmail.com 
iElectronic address: jinyuliang@gmail.com 
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Introduction— Modern online social platforms are replacine: traditional media O for 

nn 

spreading of information and communication of opinions [6l-l9||. A common feature of to¬ 
day’s online social networks (OSNs) is their gigantic sizes - for example, as of the second 
quarter of 2015, there are about 1.5 billion monthly active users on Facebook. Noticeably, 
multiplicative explosions of some information may take place at a gl obal scale in such gi¬ 
gantic OSNs, which is the fundation of viral marketing strategies (l^. Because of this, the 
information spreading is traditionally believed to be a global process. Indeed, most heuristic 


measures, such as k-shell j^, betweenness 


Q. 


closeness 


12| . Katz index j^, all evaluate 


the influence of nodes based on knowledge of global network structures. In general, these 
methods become impractical for giant OSNs, because either the full network structural data 
are unavailable, or the computational time is non-scalable. On the other hand, based on 
massive social experiments, Christakis and Fowler proposed a so-called three degrees of in¬ 
fluence (TDI) theory j^, which states that any individual’s social influence ceases beyond 
three degrees (friends’friends’friends), and therefore suggests a local effect. A recent study 
also shows that a local approximation works fairly well for the global measure of collective 


influence 


In]. 


The above situation bares a seemingly paradox, which inspires us to ask a 


fundamental question: is there a local aspect hidden in the overall global nature of spreading 
processes? 

Here we discover this hidden local aspect and provide a theoretical explanation for its 
origin, in both random networks and real OSNs (see Methods). We find ubiquitously bimodal 
behavior of any stochastic spreading process described by the Susceptible-Infected-Recovered 
(SIR) model (4 Il6l-ll8j| (see Methods): the information either spreads virally to a large set 
of nodes, whose fraction (with respect to the total number of nodes in the network) is of 
order one (Fig lA and 1C), or diminishes quickly beyond a local length scale (the influence 
radius). The viral and local phases are unambiguously separated, and it can be shown that 
the local phase plays the key role in the quantification of influence. Indeed, we show that, 
the influence, or the spreading power, of any node can be quantified from pure local network 
structure within the influence radius. Mapping the SIR model to a bond percolation 


19 ] 


further reveals the analogy between the influence radius and the correlation length in critical 
phenomena, which provides a physical understanding for the existence of this local length 
scale. 

Spreading power— To quantify the influence of an arbitrary node i, we define the single 
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node spreading power S{i) as the average number of active nodes, 


N 




( 1 ) 




where g{i,s) is the probability that a total of s nodes are eventually activated by node i 
in a network of N nodes. We find that the distribution function g{i,s) has two important 
features: (i) It consists of two peaks, which correspond to local and viral spreadings respec¬ 
tively. The local peak is located at small s, while the viral peak is centred at significantly 
larger s (Fig lA and C). Furthermore, the viral peak is (5-function-fike, whose location is 
independent of node i and different stochastic realizations (SI. Sec. II). (ii) The two peaks 
are separated by a wide gap, which implies that one may introduce a filter parameter to 
distinguish them. The two features are explored below within the framework of percolation 
theory. 

nn 

We first map the SIR spreading to a bond percolation [4, ll7[ , where every link (bond) has 
a probability (3 to exist and l—(3 to not exist (see SI Sec. Ill for the more general case where (3 
is link-dependent). The final network forms many connected clusters of different sizes. It has 
been proven that, the probability distribution function g{i, s) in Eq. ([I]) is exactly equivalent 


to the percolation size distri 
the node i belongs to 


Dution function p{i,s), where s is the size of the cluster that 


20|]. According to percolation theory, a giant component of size 


s°° emerges above the percolation transition threshold (3^ (see Fig. IB), where p{i,s) splits 
into a finite (non-giant) part P{{i,s) and a giant part p(i,s°°) (see Fig. 1C and SI Sec. II). 
The size of s°° is proportional to N, and depends on f3. Because ^ spi{i, s) -C s°°p{i, s°°), 
we may approximate Eq. CO as, 




( 2 ) 


where s°° = ^^iP(b s°°). In other words, the spreading power is the product of the giant 
component size and the probability that this node is in the giant component. In random 
networks, theoretical formulas for S{i) are obtained (see SI Sec. IV and V). Our calculation 
yields p{i, s°°) = 1 —(1—g)^% where ki is the degree of node i, and s°° = N YlT=i 
g)^], where q is the probability of a random link to be connected to the giant component, 
and is determined from a self-consistent equation q = “ (1 ~ with 

average degree (k) and degree distribution P{k) |2f| . 
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Next, we show that the wide gap between the two peaks in p(z, s) (Fig. 1C) can be used 
to distinguish between viral and local spreadings. In any SIR process, once the number of 
activated nodes reaches a threshold parameter m, the simulation could be terminated since 
this process is known to become most likely viral. We thus obtain a second approximated 
form of the node spreading power - the truncated spreading power. 


S(i) = s^p(i,s^), 


(3) 


where p(i, s°°) = •®)’ s°°). It turns out that percolation theory 

provides a fundamental approach to determine m. In fact, according to the theory, the 
distribution Pi{i,s) has a fast decay tail where s* gives a characteristic size of the 


finite components 


22h24|. For any m > s*, the error introduced in S{i) by truncating 


this tail is small (see Fig. 2A for a comparison between S{i) and S{i)). Figure 2B shows 
that the relative error E^{i,m) = [A(i) — 5'(i)]/5'(z) decays quickly with m, and becomes 
negligible when m > s* (see SI Sec. IV and V for a theoretical calculation of the error in 
random networks). The collapsing of the tails of £'^(i, m), which is essentially related to the 
integral of pf(z, s), further reveals that s* is independent of i. The characteristic size s* has 
an important application: once it is determined either theoretically or numerically, it can 
be used as a threshold for the parameter m. As long as m is chosen to be above s* (see SI. 
Sec. VI), the truncated spreading power S{i) is a good approximation for S{i), and its error 
is well controlled. 

The average of p{i, s) gives the global cluster distribution function p(s) = ^ -s)- 

Its finite part Pf(s) has the same tail as that of pf{i, s) 0,0 Q (see Fig. ID), 

Pf(s) ~ (4) 

which can be used to obtain s* theoretically in random networks 0]. For example, in an 
Erdos-Renyi (ER) network, we obtain (see SI Sec. IV). An expansion of 

this expression around the percolation transition /3c gives the critical scaling s* r\j i/s-Ar''", 
with the mean-filed exponent a = 0.5. For real OSNs, s* is obtained by fitting the simulation 
data to the exponential tail in Eq. (|1)) (see Fig. ID and SI. Sec. VII). Figure ID inset shows 
that s* in real Facebook OSN also satisfies the critical power-law scaling. 

To reveal the topological meaning of the characteristic size s*, we define an influence 
radius £* associated to s*. We perform SIR simulations until s* nodes are activated and assign 
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the maximum distance between the seed and active nodes, averaged over all realizations and 
nodes, to be the influence radins t. For a typical /3 snch that = 0.3A^, we find that 

~ 3 — 4 in all OSNs studied, which is significantly smaller than the network diameter, see 
Fig. 2C. This result shows that if an SIR spreading is local, then it would vanish within three 
to four steps, otherwise, it will spread to about s°° = 0.3N nodes. Note that i increases 
when P ^ 13c (see Fig. 2D), whose scaling is discussed in SI Sec. IV. This behavior is 
analogous to a critical phenomena of a continuous phase transition: at the critical point, the 
correlation length diverges, bnt as long as it moves beyond the critical point, a characteristic 
scale appears. 

The above analysis resolves the seemingly paradox: while it is shown that the information 
spreading is a general global process due to the viral spreading, the influence of any node 
basically only depends on its local network environment. An important product of this 
finding is a scalable algorithm to identify best spreaders, which is detailed below. 

Algorthim— We aim to find the best M spreaders V = {^1,^2, • • ■, vm } from a given 
set W of L candidates with average properties (see SI Sec. VIII), such that the solu¬ 
tion maximizes the collective spreading power S{V) — ^^^=1 s p{V, s), where p(V, s) is the 
probability that a total of s nodes are activated by the M spreaders. In the same spirit of 
the single node spreading power S{i), we introdnce a trnncated collective spreading power 
,S(V) = s°° p(V, s°°), where p{V, s°°) is the probability that at least a clnster of m nodes are 
activated by the M spreaders (see SI. Sec. II, IV, and V). While the computation time for 
ACF) linearly increases with N, it becomes V-independent for S{V). 

We propose a percolation-based greedy algorithm (PBGA) (see Methods and SI. Sec. IX), 


that adapts the natural greedy algorithm (NGA) jl|]. As expected, simulation results show 
that the computational time of PBGA is almost independent of network size N (Fig. 3). 
This rednction becomes significant for a world-wide OSN. In SI (Tabel SI), we snmma- 
rize the theoretical time complexities of PBGA, NGA, and other widely used algorithms, 
including brute-force search (BFS), maximum degree (^D), maximum k-shell (MKS}j2], 
genetic algorithm (GA), maximnm betweenness (MB) (ll|, maximum closeness (MG) (l2l |. 


maximum Katz ( 
fluence (MGI) [15 


VIK) index 


Q. 


Q, 


and maximal collective in¬ 


eigenvector method (EM) 

]. Among these algorithms, only PBGA, MD and MGI have V-independent 
theoretical computational complexities. 

We next quantify the algorithm performance by comparing the collective spreading power 


6 









^'(V) of the solution set V from different algorithms (Fig. 4A). Our results show that 
for the entire range of studied M, the three algorithms, PBGA, NGA and GA, have the 
best performances. Remarkably, the three algorithms give solutions indistinguishable from 
the true optimum, when M is small (Fig. 4A inset). In fact, we conjecture that for 
any M, the solution of PBGA should be nearly optimized. To support this, we analyze 
two lower bounds of the performance ratio PR = S{V)/S{V*) between the PBGA perfor¬ 
mance *S'(V) and the exact optimal performance S{V*) (see Fig. 4B): (i) a combined bound 
PR™b = niax{ j. goo) ) p(i!v’s°°^) where W is the set of M nodes with the maximum 
individual probability p(i, s°°), and (ii) an approximated bound PR^prox = i-O- g°°)] ’ 

that becomes rigorous in random networks (see SI. Sec. X). The two rigorous boundaries, 
p(w’g°°^) ’ '"^ork well in the small and large M limits respectively, where they 
both approach to one, and the approximated bound PR^prox ~ 1 foi" ^ value consid¬ 
ered. Considering the above analysis, we argue that PBGA gives a nearly optimized solution 
for an arbitrarily given number M of spreaders. 

Because the characteristic size s* grows when (3 approaches to /3c, the reduction of the 
computational time of PBGA would become smaller. However, we show that the solution 
obtained by PBGA at a fixed /3o (/3o > /3c) can be nevertheless applied to any /3 (/3 > /3c, see 
SI. Sec. XI for the case /3 < /3c). Indeed, if Vq is the solution obtained by PBGA at /3o, then 
the performance S'(Vq) measured at any other /3 is equivalent to *S'(V^), as long as /3 > /3o, 
where is the solution obtained at /3 (see Fig. 4C). On the other hand, the performance 
<S'(Vo) worsens when /3 < /3o and decreases towards the critical point /3c. However, even 
in the worst case studied (when /3 ~ /3c), <S'(Vo) is still about 60% of S'(V^) (see Fig. 4C). 
Our result suggests that the infiuence of nodes mainly depends on the network structure, 
rather than on the dynamical parameter /3. When the actual value of /3 is unknown in real 
problems, the PBGA could still give a reasonably good solution, as long as /3o is chosen to 
be grater than and not too close to /3c. 

In this letter, we show from first principles that any node’s infiuence can be quantified 
purely from its local network environment, based on the nature of the spreading dynam¬ 
ics. Our approach is distinct from other local attempts, which usually use some distance 
truncation strategies to approximate a global measure. For example, while the collective 
infiuence is in principle a global measure, it could also be well approximated using local 
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network structure within a distance parameter I (see SI. Sec. IX). Our findings here 
explain why such a local approximation works well, and furthermore, our influence radius 
t may provide the optimal value of this local distance parameter £ (27|. 


Summary of methods 

Data sets- We investigate nine real-world data sets (see SI Sec. I for details). The undi¬ 
rected OSNs include (1) the friendship network of New Orleans, Louisana Facebook (NOLA 
Facebook) users, (2) the arxiv collaboration network of General Relativity and Quantum 
Cosmology (CA-GrQc), (3) the arxiv collaboration network of High Energy Physics Theory 
(CA-HepTh), (4) the Email-Enron network, (5) the DBLP collaboration network of com¬ 
puter science, (6) the QQ friendship network, and (7) the LiveJournal social network. The 
directed OSNs include (8) the friendship network Macau Weibo users, and (9) the network 
of Delicious users. 

Dynamical model- The SIR model was initially introduced in the context of disease 


spreadmg Q, and^ been supported by many recent empirreal observations and 

social experiments [30|. The process starts with an initial set of active spreaders. At each 
time step, the active nodes attempt to activate their inactive neighbours with a constant 
probability /3. The attempt is only carried out once; the active nodes then enter the recov¬ 
ered state, where they cannot make further attempts to active their inactive neighbours in 
subsequent steps. 

Percolation-based greedy algorithm - The main strategy of PBGA is as follows: we (i) 
first find the best spreader vi with the maximal individual spreading power S{vi) (in this 
study, we use m = 2s* in the calculation of S), (ii) then fix hi and find the second best 
spreader V 2 that maximizes the collective spreading power <S(V) for V = (hi, 1 ) 2 }, and (iii) 
repeat this process M times until M spreaders V = {hi, V 2 ,... vm} are selected. As a greedy 
algorithm, the PBGA maximizes the marginal gain in the objective function ACP) at each 
step. Note that, replacing the objective function by the real spreading power in the above 
procedure would basically recover the NGA (see SI Sec. IX for more details). 
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FIG. 1: Mapping information spread to percolation. (A) Examples of simulated local (1,3,5) 
and viral (2,4,6) SIR spreadings in the NOLA Facebook network (/3 = 0.02,/3c = 0.01). We start 
the simulation from a randomly chosen node (red, k = 27). The active and non-active nodes 
are colored in orange and white respectively. (B) An illustration of giant (left) and finite (right) 
clusters in a bond percolation process. (C) The spreading probability distribution s) (columns) 
is plotted together with the cluster size distribution function p{s) (line) obtained from percolation. 
Note that p{s) is the average of g{i^ s) over all nodes. In this example, we use the same seed node i 
as in (A), but other randomly chosen nodes give similar bimodal distributions, with the same viral 
peak at (see SI Sec. II). (D) The finite part Pf(s) (points) of p(s) is fitted to Eq. ([3]) (black 
solid line) to obtain the characteristic size s* = 32.9 ± 0.6 and the exponent r = 2.50. (inset) 
The characteristic size 5* is fitted to a power-law divergence near the critical point /3c, with a 
non-mean-filed exponent a = 1.05. The same network and the same /3 are used in (A-D). 
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FIG. 2: Spreading power. (A) Comparison between the truncated spreading power S{i) (Eq. [3j) 
(m = 25*) and the exact spreading power S{i) (Eq. [1]) in NOLA Facebook and Macau Weibo 
{/3c = 0.05) networks, where each point represents one node. (B) The m-dependence of the relative 
error E^(z,m) of nodes whose degrees are equal to the average degree (fc), in the NOLA Facebook 
(/3 = 0.02) network, (inset) The m-dependence of the relative error E^(i, m) of nodes with different 
degrees. The relative error decreases quickly with m and becomes smaller than 1% when m > 5*. 
(C) Comparison of the inffuence radius £* and the network diameter D in nine OSNs, and two 
random networks (an ER network with N = 50000, {k) = 10, and a scale-free (SF) network with 
N — 50000, P(A:) ^ We choose (3 in different networks such that the fraction of the giant 

component is the same, i.e., 5 ^ = 0.3A^. (D) The NOLA Facebook inffuence radius £* is smaller 
than the network diameter for any [3 > I3c. 
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FIG. 3: Algorithm time complexity. (A) Comparison of the computational time of PBGA 
and NGA in ER networks with /3 = 0.2 (/3c = 0.1). The algorithms select the set of M = 10 
most influential nodes out of L = 1000 candidates with degree at {k) = 10. The computational 
time is fitted to T{N) ^ seconds (lines), with 9 — 1.17 for NGA and 9 — 0.2 for PBGA (The 
simulations are performed using C# programs on a server with 2.40 GHz Intel(R) Xeon(R) CPUs. 
We expect that the exponent 9 weakly depends on the machine and the computer program). (B) 
Comparison of the computation time (rescaled by {k) and m) of PBGA and NGA in real OSNs 
(open symbols, from left to right, CA-GrQc, CA-HepTh, Macau Weibo, Email-Enron, NOLA 
Facebook, DBLP, Delicous, QQ, and LiveJournal). The value of /3 is chosen such that — 0.3A 
in each OSN. The best fittings give 9 — 1.16 for NGA and 9 — 0.12 for PBGA. Extrapolating the 
fittings to the global Facebook (^ 1.5 billion users) and Twitter (^ 320 million users) networks 
give the expected computational times (filled symbols) for NGA (Facebook ^ 77 years, Twitter 
^11 years) and for PBGA (Facebook ^ 3.3 minutes, Weibo ^ 2.8 minutes). 
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FIG. 4: Algorithm performance. (A) Comparison of the algorithm performance S'(V) (normal¬ 
ized by a common factor s^) of different algorithms (see text) in NOLA Facebook (/3 = 0.012) 
network, for different number of seeds M, with L = 100 candidates, (insets) The regime 1 < M < 6 
is enlarged, where the rigorous optimum obtained from BFS is available. (B) The combined rigor¬ 
ous lower bound approximated lower bound PR^^prox? plotted together with 

the submodular lower bound PR^bUod ~ ^*63 [li, as functions of M. (C) The PBGA solution 
Vo obtained at /3o (filled symbols) is applied to other f3 values, and its performance in ^(Vo) is 
compared to the performance ^(V/^) of the solution V^ obtained at f3. 
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Supplementary Information 


I. DATA SETS 

In this study, nine sets of real-world online social networks (OSNs) are used to represent 
both undirected and directed networks. 

A. Undirected OSNs 


• CA-GrQc: The collaboration network of arxiv General Relativity and Quantum Cos¬ 
mology (CA-GrQc) Q is downloaded from Stanford Large Network Dataset Collection 
(SLNDC, http://snap.stanford.edu/data/). Two authors (nodes) are connected if 
they have at least one common co-author. There are in total 5242 nodes with average 
degree 5.53. 


CA-HepT( 

HepTh) 


i: The collaboration network of arxiv High Energy Physics-Theory (CA- 
, is obtained from SLNDC. It contains 9877 nodes and each node has 5.26 


edges in average. 

• Email-Enron: The Enron email communication network 


m 


is downloaded from 


SLNDC. Nodes in this network are email addresses. If two email addresses have at 
least one communication between them, they are connected by an edge. This network 
has 36692 nodes with average degree 10.02. 

• NOLA Facebook: The Facebook friendship network is obtained from the On¬ 
line Social Networks Research @ The Max Planck Institute for Software Systems 
(http://socialnetworks.mpi-sws.org/data-wosn2009.html). It contains 63731 
users (nodes) in New Orleans, Louisana (NOLA), with average number of friends (av¬ 
erage degree) 25.64. 


• DBLP: The DBLP co-authorship network of computer scientists [3^ is obtained from 
SLNDC. It contains 317080 nodes with average degree 6.62. 

• QQ: The Tencent QQ, is a popular messaging software service in China. Two QQ 


users are connected if they are friends. The QQ network 
average degree 14.41. 


351] has 1113435 nodes with 
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Live Journal: The Live Journal network 


34l | obtained from SLNDC. It’s an nndi- 


rected social network with 3997962 nodes and average degree 17.35. 


B. Directed OSNs 


• Macau Weibo: Weibo, also known as Sina Weibo (http://www.weibo.com), is the 
largest micro blog platform in China. Similar to Twitter, Weibo users are connected 
to their followers through directed links. The network data of Weibo users in Macau 
were crawled in October 2012. The network contains 24023 nodes, with in average 7.73 
followers. 


Delicious: Delicious.com is a web site for storing, sharing and discovering web book¬ 


marks. The network of Delicious 
2.90. 


36l | subscribers has 582377 nodes with average degree 


II. SPREADING POWER 

In this section, we give mathematical definitions of spreading powers. In Sec. |IV| and 
Sec. |Vl we will explain how to calculate these spreading powers theoretically in infinitely 
large undirected/directed random networks. 


A. Node spreading power 


1. Undirected networks 


As discussed in the main text, the spreading power S{i) of node i is defined as 

N 

S{i) = '^sp{i,s), (5) 




where p{i, s) is the probability that node i belongs to a connected component with size s 
in the bond percolation (recall that g{i, s) = p{i, s), see Eq. (1) in the main text). For any 
node i, the p{i, s) ubiquitously consists of two peaks, corresponding to the local and giant 
components respectively (see Fig. El). In large networks {N —?• oo), non-giant components 
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are negligible and the definition Eq. (jS]) is equivalent to 

S{i) = s^p{i,s^). (6) 

where s'^ is the size of the giant component in the super-critical phase (/3 > Pc), and p{i, s°°) 
is the probability of node i belonging to the giant component. Note that when N ^ oo, 
the giant component is unique (see Sec. \nxm . and therefore the giant component size 
s°°(z) = s°° is independent of i. When N is finite, is narrowly distributed (see Fig. [5)), 
and we approximate this narrow distribution as a h-distribution. The giant component size 
s°° is related to p{i, s°°) as 

N 

= (7) 

i=l 

To reduce the time complexity in computation, we further define a truncated spreading 


power to approximate Eq. dH]), 


(8) 

where 

N m—1 


p(z,s°°) 

= ^p{i,s) = 1 - ^p(i,s), 

s=m 

(9) 

and 

N 



= 

(10) 


1=1 


Here we use a parameter m to distinguish a giant component from a non-giant component. 
The role of m is discussed in details in the main text. In practice, we use the Monte Carlo 
method to evaluate the summation in Eq. (HDD. We calculate the average of Wampie randomly 
picked nodes to sample the real average over N total nodes. For NOLA Facebook and Macau 
Weibo networks, Fig. |6] shows that Wampie = 500 Monte Carlo samples are sufficient to have 
an accurate estimate of Fig. [71 shows that the truncated spreading power agrees well 
with the exact spreading power s°°. 

2. Directed networks 

A percolation process on directed networks forms (i) a giant strongly connected compo¬ 
nent (GSCC) in which every node can be reached from every other, (ii) a giant-in component 
(GIN) in which every node can reach the GSGG, and (iii) a giant-out component (GOUT) 
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FIG. 5: (A-B) Two examples (see Fig. 1C in the main text for another example) of g{i, s) 
{g{i,s) — p{i,s)) in the NOLA Facebook network. The g{i,s) for arbitrary i ubiquitous has 
two peaks, that can be separated by s*. (C) The giant component sizes of individual nodes 

are narrowly distributed. Similar data are plotted in the second and third rows for an Erdos-Renyi 
(ER) network (D-F, N = 50000, (/c) =2,j3 = 0.6) and an random scale-free (SF) network (G-I, 
N = 50000, P(A:) ~ = 0.15). 


in which every node can be reached from the GSCC, see Refs. (37| and Fig. El The intersec¬ 
tion of the GIN and the GOUT gives the GSCC. Any node i in the GIN would eventually 
influence every node in the GOUT (note that a node in the GOUT does not necessarily 
influence the GIN), and therefore the spreading power of node i should be the product of 
the size of GOUT, and the probability p’"(b s,^) of it being in the GIN: 


5(z) = s- p‘“(^,0, 


( 11 ) 
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FIG. 6: Estimate of (points) using the Monte Carlo method, with m = 50, /3 = 0.08 for Macau 
Weibo, and (3 — 0.02 for NOLA Facebook. The error bar represents the standard error of the 
mean, calculated from 100 independent simulations. The exact (lines) is the average value of 
the giant component size, obtained from 10000 bond percolation realizations without using the 
cutoff m. 



FIG. 7: The approximated giant component size (points, m — 50) are compared with the 
exact spreading power (lines), for different values of /3, in NOLA Facebook and Macau Weibo 
networks. In the Macau Weibo (directed) network, both GIN and GOUT sizes are plotted. 

where 

N 

= ( 12 ) 

1=1 

with and being the sizes of GIN and GOUT. 

The truncated spreading power is defined as 

S(i) = 5“ p'"(i,5~), (13) 
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where 


N 

min-l 


p"(».o= E 

T’"(bSin) = 1 - 

(14) 

Sin=mi: 

n Sin = l 


N 

'ITlout 1 


(*.s„“.)= E 

T°"*(b'Sout) = 1 - 

(15) 

^out — 'fH'Out 

-Sout —1 



N 


-oo 

'^out 

= Ep“‘(i.ir„.). 

(16) 


i=l 

Here, in principle we could have two cut-off values rUin and rUout- In this study, we consider 
the simplified case where we set m = rUin = rUout- 



GIN 



FIG. 8: An illustration of GIN and GOUT components in a directed network. 


3. Uniqueness of the giant component size 

Here we show that, in a percolated random network, there should be only one unique 
giant component, whose size is s°°. Therefore, for each node i, its in Eq. (ED should 

be all identical to s°°, i.e., = s°° for any i. 

In the supercritical phase (5 > (5c, the ratio s°°/N is finite, while all other components 
have zero fractions. When N ^ oo , s°° diverges with N. If there are two giant components, 
whose sizes are and s“, then both sf /N and s'^/N are non-zero. Let fcj be the degree of 
an arbitrary node i in the first giant component. For each link connected to i, the probability 
that it is not connected to the second giant component is 1 —^. The probability that node 

( ^oo \ gOO 

1 —^ 1 <1 —^. The probability 

that every node in the first giant component is not connected to the second one is less than 
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/ ^OO \ ^oo 

(1 —^ 1 , which is zero since 1 —^ < 1 and —>■ oo. Thus it is impossible to find 

two disconnected giant components. In other words, there should be only one unique giant 
component if it exists. 


B. Set spreading power 


1. Undirected networks 


For a given set V = {^ 1 ,^ 2 , • • ■,vm} of M nodes, the joint spreading power of the set V 
is defined as 

N 

S(V) = ^sp(V,s), (17) 

S = 1 

where p{V, s) is the probability of a total of s nodes being influenced by the set V. If we 
neglect the non-giant components, then Eq. m becomes 

5(V) = s°°p(V,s°°). (18) 


where p(V, is the probability that at least one node in set V belongs to the giant 
component. The truncated set spreading power is defined as 

^(V) = 5~p(V,s-), (19) 


where p(V, s°°) is the probability that at least one node in V belongs to a component whose 
size is greater than m. In practice, for each spreading simulation, one only needs to find the 
largest component influenced by any node in set V. If the size of this component is greater 
than m, then it is considered as the giant component, and the seeds in set V will influence 
the whole giant component. 


2. Directed networks 

For directed networks, Eq. (ITSll becomes 

s(v) = c,.p'“(r,0- (20) 

where p’"(V, sj^) is the probability that at least one node in set V belongs to the GIN, and 
the truncated set spreading power becomes 

S(V) = 5“ p'“(r,s“). ( 21 ) 
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where is the probability that at least one node in V belongs to a in-component 

whose size is greater than min- 

III. FAMILY OF SUSCEPTIBLE-INFECTED-RECOVERED (SIR) MODELS 

In the main text, we focus on the basic SIR model with a constant spreading rate j3. 
The basic SIR model may be generalized to a family of models, where the probability /3 is 
link-dependent. For example, we could assume (3 to be dependent on the degrees of nodes i 
and j connected to the link, 

_ 7 

~ (k,+kr’ 

where b and a are two parameters. It is straightforward to generalize our formulas and 
algorithms to the generalized SIR model, by substituting j3 with /3(j,j). For example. Fig. IHl 
shows that the truncated spreading power S{i) also agrees well with S{i) for the generalized 
SIR model, in both NOLA Facebook and Macau Weibo networks. 




FIG. 9: The spreading powers in a generalized SIR model with the link-dependent spreading rate 
(see Eq. (1221) 1. The truncated spreading power S{i) is consistent with the exact spreading 
power S{i) in the NOLA Facebook network with b = 0.4, a = 1 and m = 50, and in the Macan 
Weibo network with b = 1, a = 1 and m = 50. 


IV. THEORY FOR UNDIRECTED RANDOM NETWORKS 

In this section, we use the percolation theory to calculate the spreading power, the error 
of the truncated spreading power, the characteristic size s*, and the influence radius £*, in 
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undirected random networks. 


A. Spreading power 


We first derive theoretical formulas for the S{i) (Eq. [6]) in an undirected random network 
with arbitrary degree distribution P{k). Because random networks have locally tree-like 
structures, the spreading power of a node i only depends on its degree ki. Let q be the 
probability of a randomly chosen link belonging to the giant component, then we could 
write 

p(i,0 = l-(!-#’• (23) 

Here the term (1 — is the probability that none of the node’s links is connected to 
the giant component. Thus 1 — (1 — g)^Ms the probability that at least one of its links is 
connected to the giant component, i.e., the probability of the node being connected to the 
giant component. According to Eqs. ([7j) and ([231), giant component is 

N oo 

= NY,Pik)[l - (1 - qf]. (24) 

i=\ k=l 

For the set spreading power, 


p(V.s“) = [l 


(25) 


where, ky = + ■ ■ ■ + kyj^. 

A similar argument can be applied to the link probability q. If a link is connected to a 
node with degree k, then this link is connected to the giant componer^ when at least one 


of the remaining k — 1 links is connected to the giant component 


38|, 


39l, 


(26) 


Here {k) is the average degree, is the probability that a randomly chosen link leads to 
a node with degree k, and [1 — (1 — g)^“^] is the probability that at least one of the other 
k — 1 neighbours is connected to the giant component. The probability g of a randomly 
selected link belonging to the giant component is the probability that at least one of the 
other k — 1 neighbours is connected to the giant component, multiplied by the probability 
(3 that this selected link is occupied in the percolation process. 
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We summarize how to calculate the theoretical spreading power in random networks: for 
given P{k) and /3, we first solve Eq. fEHl) numerically to obtain q, then plug its value into 
Eqs. fl2iD and (I2HD to get s°° and and finally use Eq. (P) to obtain the spreading 

power S{i). A similar strategy can be used to calculate the set spreading power A(V). 


B. Generating functions 


To evaluate the error of the truncated spreading power S{i) and the characteristic compo¬ 
nent size s*, we use the generating function method [2^. For an undirected random network, 
the generating function go{x) for the degree distribution P{k) is 


9o{x) = 

k 

and the generating function gi{x) related to the branching process is 

jiW = 53 


(27) 




(k) 


(28) 


For a bond percolation (spreading) process with (5 G [0,1], the generating functions be¬ 


come 


and 


Q 


Gci(a.') = 9o[l - /?(1 - x)]. 


Gi(x) = 9 i[l - ^(1 - i)]. 


(29) 


(30) 


According to Ref. 
size s. 


22l |. we also obtain the generating function H{x) for the component 
H{x) = xGo[Q{x)], (31) 


where 

Q{x) = xGi[Q{x)]. (32) 

We use Eqs. dSOD, and (j3^ to solve Q(x) numerically for any given (5 and P{k), and 
use Eq. ea to obtain H(x). 

The expansion of H{x) gives the probability p(s) = s) that a randomly se¬ 

lected node belongs to a component with size s, 

s°° 

H{x) = '^'p{s)x^, (33) 

S = 1 
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where the summation does not include the upper limit s°°. In the sub-critical phase 
P < Pc, i7(l) = 1; in the super-critical phase /5 > /5c, 1 — ^(1) = s^/N^ which means that 
the probability to find the giant component 5^ is 5^/A^ > 0. As in Ref. [221], we use the 
Cauchy formula and the residual integration to calculate p{s)\ 


p{s) = 


1 d^H{z) 


SI 


dz 


z=0 


2711 T 


(34) 


where H{z) is an analytic continuation of H{x) to a complex domain. 

The generating function H{i,x) of an individual node i can also be obtained. If we 
randomly keep fraction P of links connected to node z, then the generating function of its 
degree distribution is [1 — P{1 — x)Y\ Similarly, the generating function for the size of 
component containing node i is : 


H{i,x) = x{l-P[1 -Q{x)]Y\ 
which can be used to get the probability distribution p{i, s), 

s°° 

H(i,x) = 'p{P s)x^. 


(35) 


(36) 


S = 1 


C. Error of truncated spreading power 


In this subsection, we show how to evaluate the error of the truncated spreading powers 
S{i) and <S(V) in an undirected random network. According to Eq. (jHj) (or Eq. (Il9j)), there 
are two sources of the error, s°° and p{i, s°°) (or piV, s°°)), which will be discussed in details 
below. 

We hrst discuss the error of the giant component size . Because we count any non-giant 
component with size s > m as the giant component, it introduces an error. The absolute 
error ^^(m) = (s°° — s°°)/N is 


m—1 


C(“) = E'pW = H(l) - ^p(s), 


(37) 




where we have used s°° = A’fl — i^(l)], see Ref. [2^. In practice, H(1),p(l),p(2), ■ ■ ■ ,p{m) 
are calcnlated nnmerically nsing the generating function method discussed in Sec. IIVBI We 
fnrther define a relative error as 

77(i)-Er.2p(*) 


{(o)™) = 


1-77(1) 


(38) 
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Next we consider the error of the probabilities p(z, and p(V, s^). The error of p(z, s^) 
conies from the over-estimate of non-giant components with sizes larger than m, as can be 
seen from Eq. ([9j). Thus the absolute error e^(i, m) = — p{i^ is 


m-1 

rn) = Y^ 'p{i, s) = H{i, 1) - ^)’ 

s=m s=l 


(39) 


where p{i, s) is calculated using the Cauchy formula and the residual integration as p{s) in 
Eq. flHD) . Because p{i, s°°) = 1 — H{i, 1), the relative error of p{i, s°°) is 


^ pii,s°°) -p{i,s°°) ^ H{i, 1) - J27=i Pjhs) 
p(i, s°°) 1 — H{i, 1) 


(40) 


Since individual nodes are uncorrelated in random networks, the truncated set spreading 
power p{V, s°°) in Eq. (|H|) is 


p(V,s^) = l 


P[[l -p{i,s^ 
iev 


)i=i-n 


ieV L 




(41) 


Here Uiev ^ “ Xs=m^’(h •®) probability that all nodes in set V belong to components 

with sizes smaller than m. Thus p{V, s°°) is the probability that at least one node belongs 
to a component larger than size m. According to Eq. (HTl) . the absolute error of p{V, s°°) is 


e-(V,m)=p(V,0-p(V,s~) 


i-n 


iev 


1 - 


iev 


=n[i-p(i.s~)i-n 

iev iev L-s=i 




(42) 


'm—1 


When V = {i}, we recover the single node error Eq. 

With the above two errors obtained, we now write down the relation between S(i) and 

m. 

S(i) =S(i)[l + e'(i,m)][l+{)„(ro)]. (43) 


which gives the total absolute error E^{i,m) = S{i) — S{i), and the total relative error 
E'^{i,m) = [A(i) — S'()i)]/S'(i). This result can be also generalised to the set probability. 


E^(y,m) 


[p(y, s°°) + e^(V, m)] 
p{V, s°°)s°° 


(44) 
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D. Characteristic component size 


Here we discuss how to calculate the theoretical value of s* in random undirected networks 
using the generating function approach. Using the formula of the distribution function of 
finite (non-giant) component sizes |22l |. 


-T-S/S* 


Pf(s) = cs ’’e 

where c is a constant, H{x) can be written as (see Eq. (1331)) 


(45) 


H{x)=c'^‘ 




(46) 




Note that here we use Pi{s) to denote the distribntion of components with finite sizes, i.e., 
Pf('S) = p('S) with s < s°°, where p(s) is the full distribution function including the giant 
component s°°. The full distribution is always normalized to the unit, '^l=iP{s) = 1. When 
(5 > /3c, Pf(s) is not normalized to the unit, X)s=i < 1) because the giant component 

has a non-zero weight. Fig. flOl shows that Eq. fH5l) works well for random networks and the 
real NOLA Facebook network. 

The parameter s* is related to the radius of convergence |x*| of the generating fnnction 
H{x) as (see Ref. 22]): 

= r- s ! = • (47) 


Thus we have 


lims^oo ^ 

1 


s* = 


The radius of convergence |a:*| is eqnal to the distance between the nearest singnlarity of 
H{z) and the origin on the complex plane. It can be calcnlated by first numerically solving 
w* form the equation, 

Gi(w*)- w*G[{w*) = 0, (49) 

where G[{x) is the first derivative of Gi{x), and then plngging w* into the relation 

w* 


I I _ 


Gi{w*) 


(50) 


For ER networks with average degree (/c), we have G'i(x) = Go{x) = (see Ref. |22l|). 

which gives 

(51) 


^ER — 


m' 
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According to Eqs. (IHHl) and fHHl) . we obtain 


and 

_ 1 

^{k) - 1 - In/3 - In(fc) ■ 

Expanding around /3c = 1/ (k) gives the critical scaling, 


( 62 ) 


(63) 


'®ER ^ ^ 

The exponent —2 is consistent with the standard mean-field result. Note that, this scaling 
applies to any random networks. 

The above analyse gives s* in Pi{s) for the whole network. Eor any individual node 
i, it has its own individual size distribution function of the component containing i. Its 
corresponding characteristic value s*{i) is the singularity of H{i, z) nearest to the origin. It 
can be shown that 

s*(i) = (55) 


because the singularities in H{z) and H{i,z) are determined by the same function Q{z), 
according to Eqs. and ([35]). 


E. Influence radius 


We consider the average number zi of /th-nearest neighbors 


Q. 


'zi, 


(66) 


which is used to estimate the influence radius £*, 


1 + — 


sT 


1=1 


(57) 


This gives an expression for £*, 

n* _ ln[(5* - 1)(/3^2 - Zi) + /3zl] - ln{/3zl) 

H/3z2/z^) • ^ ^ 

Note that if we replace s* by N, and set /3 = 1, then we recover the expression for the 
average path length in random networks (see Eq. (53) in Ref. 0 ). 
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FIG. 10: The finite cluster size distribution function Pf{s) in random and real undirected networks. 
(A) The ER network with {k) = 4, /3 = 0.3. The theoretical value of s* is 56.6, and the numerical 
fitting to the distribution function Pf{s) using Eq. (j45l) (solid line) gives 5 * = 61.4 ± 0.6 (vertical 
dashed line). (B) The SF random network with P{k) ^ 7 < k < 100, and /3 = 0.1. The 

theoretical value is 5* = 3.4, and the numerical fitting (solid line) gives 5* = 3.7 ± 0.2. The 
red dashed lines in (A) and (B) are theoretical results obtained from Eq. flMD . (C) The NOLA 
Facebook network with (3 = 0.02. The numerical 5* = 32.9 ± 0.6 is obtained from fitting (solid 
line). 


For ER networks, ^1 = {k), Z 2 = {k{k — 1)) = {kY, and jSc — ^/{k), Eq. fj58l) becomes 


» _ /^c//^) + Pc/f^] 

ln(/3//3e) 


(59) 


where is given in Eq. fj5^ . To obtain the critical scaling of we expand j3 aronnd /3c 
as (3 = /3c(l + £), where e <C 1. Substituting this expansion in Eq. (l59ll . together with the 
critical scaling ^ we obtain ^ — In |£|/|£|. In the vicinity of /3c, |£|“^ diverges 

much faster than — In jej. Thus the critical scaling of is 


^ER ^ 1/^ /^c 


(60) 


which is exactly the same as the critical scaling of the correlation length in random net¬ 
works j37|]. The scaling fjbOl) is examined in Eig. [HI with an analysis of the finite size 
effect. 

We have shown in Sec. IIV PI that s*{i) is independent of i. Based on this, it is easy 
to show that the individual node’s influence radius t{i) is also independent of i. Indeed, 
because the number of nodes grows exponentially with t{i) (see Eq. 671 ). and linearly with 
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FIG. 11: (A) The numerical data of with different N are compared to the theoretical scaling 
Eq. fl60]) . The simulation results are obtained from ER networks with average degree (k) = 2. 
Because the ER network is small-world, the value of approaches to a finite constant when 
13 ^ 13c. (B) The numerical data of with different {k) {N = 5000000) are compared to the 
theoretical scaling. (C) The average number of nodes N{1) on the Ith layer is plotted as a function 
of I (total N — 5000000, {k) = 2). For small /, N{1) ^ {kY grows exponentially with 1. Due to the 
finite size effect, a deviation to this exponential growth begins round I' — 18 (vertical dashed line). 
That is, when I < I' the network obeys the scalings of infinitely large networks, and finite size 
effects are not observable. When > I' (red cross in (A)), the numerical deviates from the 
theoretical scaling and approaches to a plateau. (D) Numerical data of vs. {N = 5000000, 
{k) = 2). Each data point corresponds to different /3. The data obey the theoretical scaling 
'^ER ^ (^er)^ only when consistent with (A) and (C). 

the degree /c^, the fluctuation of ki can be neglected when t{i) is large. In real networks, the 
distribution of t{i) is also narrowly peaked around its average YYi Fig- tEl 
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<•(>) 


FIG. 12: The histogram of the individual node’s influence radius (*{i) in the NOLA Facebook 
network, with (3 — 0.0375 and the average £* — 2.6. 

V. THEORY FOR DIRECTED RANDOM NETWORKS 

Many OSNs have directed links. For example, in Twitter, the followers of a user are not 
always followed by this user. In this section, we discuss how to generalize our theoretical 
analysis from undirected to directed networks. As in the case of undirected networks, our 
theory is developed based on random networks. 


A. Spreading power 


In directed random networks, we use P{k\ k°) to denote the joint degree distribution of 
in-degrees and out-degrees. For any randomly chosen directed link, let i be the node that 
can be reached along the direction of the link. Similar to Eq. , the probability gin of this 
link being connected to the GIN satisfies a self-consistent equation: 


O Y'' PP{k3 1 /c°) r , , ;^o-| 

^in = /d ^ [l — (1 — gin) ] 








( 61 ) 


where, to obtain the second equality, we have used the property P{k\k°) = P{P)P(k°), 
and the definition of the average in-degree (P) = PP{P). 

Similar to Eq. the probability of a randomly chosen node i belonging to 

the GIN is 

O = 1 - (1 - (62) 


29 




and similar to Eq. the GIN size can be written as 

sr^ = NY^p{k'x) [i-(i- 9 i.n 

= NY,p(n [i-(i-«.)*"]. 

k° 

For a set V of nodes, we have 


( 63 ) 




(64) 


where, = k°^ +k%^,-- ■, +K„- 

It is straightforward to generalize the theory to GOUT quantities, such as gout and 
For any randomly chosen directed link, let j be the node that can be reached along the 
opposite direction of the link, and gout be the probability that j belongs to the GOUT. Then 
we have. 


and 


?out 




k^P{k\k^) 

W) 


1 - (1 - gout)^‘ 


= /55]P(U) 

k^ 


1 - (1 - 9o.t)‘‘ 


(65) 


Ci = NY.p{k\k-) [i-(i-9.„.)q 


= Nj2P(k‘) 

k^ 


1 - (1 - Qontf 


( 66 ) 


Equations (IH^ and flHHl) . together with the definition Eq. (na, gives the node spreading 
power S{i). The theoretical predictions are compared to numerical simulations in Fig. [T31 
showing a good agreement. 


B. Generating functions 

For a directed random network, the generating function corresponding to the joint degree 
distribution P(k\k°) is 

= (67) 

k\k° 
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FIG. 13: The simulation (points) and theoretical (lines) results of S{i) in ER and SF directed ran¬ 
dom networks. The lines are theoretical results obtained using the formulas presented in Sec. IV A1 
The SF network has N = 50000, P{k^) ^ (A:°)“^-^ with 2 < < 100, and f3 — 0.4; the ER network 

has {k^) = 25, V = 50000 and [3 = 0.08. 


and the generating functions related to the branching process are 

___ uopfui Uo\ 

9iM = E 




and 


9l{y) = 










( 68 ) 


(69) 


for in- and out-degrees respectively. Here {P) = {k°), since every in-degree for the desti¬ 
nation node is also the out-degree for the source node. For a bond percolation (spreading) 
process, the corresponding generating functions become 


<^ 0 ( 3 :, y) = go[l - P{l-x),l- /3{l-y)], 
G\ix) = g\[l - /3(1 - x)], 


(70) 

(71) 


and 


Giix) = gi['^ - PCi-- y)]- (72) 

The generating function H°'^^{x) for the size of component that can arrive at a randomly 
selected node (including this node) is: 


77°"*(x) = xG'o[Q°"'(x),l], 


(73) 
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where 


= xG'i[Q°"‘(a^)] (74) 

is the generating function of the nodes reached by the incoming links in the branching 
process. Expansion of gives 

C500 

^out 

H°“(x) = Y, (75) 

-Sout —1 

Similarly, we obtain the in-degree generating functions as 

W^{x) = xGo[l,Q^ix)], (76) 

and 

Q^{x) = xG1[Q^{x)]. (77) 

The above formulas are used to describe the global generating functions. One may also 
derive generating functions for single nodes. In particular, for a given node i with out-degree 
k°, after randomly keeping (3 fraction of links, the generating function of its out-degree is 
|1 - /3(1 — x)Yi. Thus the generating function W'^{i,x) for the size of component can be 
arrived from node i (including this node) should be, 

i7'"(z,a:) = x{l - /5[1 - Q’"(3:^)]}^S (78) 

which can be expanded as 

.00 

*in 

i7'"(i,x) = V"(bSin)a:*‘"- (79) 

Sin = l 

C. Error of truncated spreading power 

In this subsection, we discuss how to estimate the truncated spreading power S{i) in 
Eq. flTHl) . using generating functions. Compared with the analysis of undirected random 
networks, here we have to derive the errors of p'"(i,s[^) and separately, since they 
correspond to different types of degrees. 

We first obtain the absolute error of the out-giant component size which is 

777out 1 

- E ( 80 ) 

-Sout —1 
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Its relative error is 


^oo(^out) 


out out 


^out 




Next, we consider the absolutely error of p™(i, s?^), 

cOO 

*irr 

e^(z,min)= ^ V"(b'Sin) = 1) - XI 

S=min 

and its relative error, 


min-l 


Sin = l 


€ (i, TTljn) 


p-(b S-) - p-(b S-) 1) - Er=l Nn) 


o 


For the set spreading power 

pi“(V, S~) = 1 - JJ[1 - p(z, S~)] = 1 - fj 


iev 


iev 


1 - 1 ) 


1- X 


oin — "‘'in 


we obtain the absolute error of p(V, s,^) as, 

e^(V, m) = JJ [1 - p(z, - Y[ 


iev 


iev 


m—l 




^in — 1 


(81) 


(82) 


(83) 


(84) 


(85) 


From the above analysis, the estimated spreading power of node i can be written as 


s(i,m) = SlOll + e'(i,m)| [1 +8(o(™)l. 


( 86 ) 


where we have used the simplification m = m\^ = mout- Similar to the case of undirected 
networks, the absolute and relative errors, E^{i,m) = S{i) — S{i) and E^{i,m) = [<S(z) — 
S'(i)]/5'()i), are calculated from generating functions given in Sec. IV Bl This result can be 
also generalized to the set spreading power: 


E^{V,m) 


[p(V,s? 


oo> 
in ) 


+ e^(V,m)][sggt + CgoM] 


- 1. 


(87) 


D. Characteristic component size 


Both and H™{x) can be expanded similarly to Eq. fBTIl) : 


H-^\x) = Cout > ^ S, 


\ ' Q —Tout ^ — Sout/s'^^^^Sovit 


/ J '^out 
-Sout^l 


( 88 ) 
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and 







( 89 ) 


where, Cout and cin are two constants. We use the same method as described in Eqs. fH7ll 
and (1501) to compute the theoretical and for random networks. For real networks, 
we fit the data to distributions p™(sin) = and Pf'^*(sout) = 

(see Fig. IT^ . 

Similar to Fq. (1001) . the characteristic component sizes for individual nodes are the same 
as the global ones. Furthermore, when the in- and out-degrees are both Poisson distributed 
with the same averages (k^) = {k°), and also become identical, 5*^^ = 


A B 



FIG. 14: Fitting (solid lines) the component distribntion fnnctions gives (A) = 24.0 ± 0.8 and 

(B) 5*ut = 17.9 ± 0.6 (vertical dashed lines), in the Macan Weibo network with (3 = 0.08. 

VI. DETERMINING THE FILTER PARAMETER m FOR REAL OSNS 

Unlike random networks, real networks usually have more complicated structures and 
could not be analyzed theoretically. But again we can utilize the existence of the char¬ 
acteristic size s* to set a bound for the parameter m. The truncated spreading power is 
considered to be a good approximation as long as m > s*. 

In some cases, the characteristic size s* is unavailable in real OSNs. In such a case, 
one may determine m using the following approach. Figure fTHl shows that the cumulative 
distribution function (CDF) F{m) = converges quickly to the fixed value of 

1 — To estimate F(m), one should carry out repeated spreading simulations from 
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random seed nodes, and find out the probability that the influenced component size does 
not reach m. A reasonable m can be determined from the point whereat the F{m) starts to 
saturate to a constant, as shown in Fig. US This strategy requires low computational time, 
and could be used as a quick trial when the value of s* is not available. 


A 





FIG. 15: Finding the filter parameter m through Monte Carlo simulations for (A) NOLA Facebook, 
(B) Macau Weibo in-components, and (C) Macau Weibo out-components. We perform Monte Carlo 
simulations to calculate the accumulative probability distribution function F{m). As the value of 
m increases, F{m) grows and saturates to a fixed value. As seen from the plots, the saturation 
occurs at values slightly larger than s* for both NOLA Facebook and Macan Weibo. In terms of 
sample size, Monte Carlo results with 1000 samples are already sufficiently accnrate, compared to 
the exact integral ofpf(s) (red line). 


VII. DETERMINING THE CHARACTERISTIC COMPONENT SIZE IN REAL 
OSNS 

While we have discussed how to obtain the theoretical characteristic component size s* 
of random networks (Sec. IIVDI and Sec. IV DD . for real OSNs, we have to determine s* 
numerically. We find that the distribution Pf(s) generally has an exponential tail e“*/**, as 
described by Eq. from which we obtain the fitted s* (see Fig. dH). The Pf(s) of several 
OSNs (such as Delicious, Email-Enron, and LiveJournal), has some small peaks at large s 
(note that they are not the giant component). These peaks may be caused by the existence 
of well-connected small communities, and are not considered in our fitting. 
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FIG. 16: Determining 5 * ( 5 *^^ and for directed networks) in real OSNs. The tail of Pf{s) is 
fitted to the exponential form The value of ^ is chosen such that = O.SA/' in each OSN. 

VIII. OPTIMIZATION PROBLEM 

A genetic version of the optimization problem is to identify a given number M of best 
spreaders from the whole set of network nodes [^. Solving this problem in typical OSNs is 
not too difficult (when the size is not too large). Indeed, Fig. flTl shows that selecting the 
nodes with maximum degrees works well in this case. Moreover, the relative performance 
aS(V)/5 '^ converges quickly to one. When M > 4, the influence of M hubs already covers 
the entire giant cluster 5 ^. Thus, for this simple version of the optimization problem, it is 
not highly demanded to develop more sophisticated optimization algorithms. 

A modified version of the optimization problem is to identify the M best spreaders from 
a given set of candidates. A more formal definition is: from a network with N nodes, we 
first choose L number of nodes as the candidate set VF, from which we want to And the 
set V of M nodes that have the maximal set spreading power aS(V). From the commercial 
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point of view, most influential individuals are usually celebrities who are expensive to be 


targeted, and therefore it is more cost-effective to choose average users as candidates 


m- 


In this study, we focus on this modified version, and choose nodes with average degrees as 
candidates. 
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FIG. 17: Performances of several different algorithms for the original optimization problem, where 
we aim to find the best M spreaders from the whole set of nodes in the NOLA Facebook network 
with /3 = 0.012. 


IX. OPTIMIZATION ALGORITHMS 

A. Algorithms based on the spreading power 

There are many different types of algorithms to solve the optimization problem described 
in Sec. IVIIIl ranging from simple algorithms based on heuristic measures to sophisticated 
computer algorithms. Here we give a brief description of the algorithms used in this study. In 
the discussion of the time complexity, we assume that the number E of edges is in the same 
order as the number of nodes N in the network. Therefore we replace the variable E with 
N in the formula of computational complexities. We also generally ignore the dependence 
of the computational complexity on other parameters that are not in the same scale of A, 
like M, L, or other algorithm parameters. 
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1. Brute-force search (BPS) 


From the set of L candidate nodes, there are ways to choose a set of M seed nodes. 
For each set of selection, Monte Carlo simulations are implemented to estimate its spreading 
power with computational complexity 0{N). Thus the overall complexity is 0{L^N). The 
BFS gives the exact solution, but its computational complexity is extremely high, especially 
for large L and M. 

2. Natural greedy algorithm (NGA) 

The natural greedy algorithm is adopted from Ref jl|. Its brief process is: 

1. For every source node in the candidate set L, we obtain their individual spreading power 
S{vi) through repeated spreading simulations. The node with the largest individual 
spreading power is chosen as the first selected node Vi. 

2 . Obtain the joint spreading power of a pair of nodes that contains vi and any other node 
in the candidate set. Select the node with the highest joint spreading power (Sdui, V 2 }) 
as the second selection V 2 . Note that in this step, we keep Vi (from the previous step) 
fixed. 

3. Repeat the process by finding one additional node each time that maximizes the joint 
spreading power with the already selected nodes, until M nodes {ui, • • • , vm} are found. 

The computational complexity of finding one more node in {ui, • • • ,Vm} is ©(N”), pro¬ 
portional to the total number of links/nodes in the network. The total computational 
complexity is also 0{N). 

3. Percolation-based greedy algorithm (PBGA) 

The basic idea of our percolation-based algorithm is similar to that of the NGA (see 
Methods in the main text). According to Eq. (1131) . we only need to consider the probability 
p(b s°°) of the node being in the giant component, since the giant component size s°° is a 
constant. In practice, our algorithm is designed as follows. We first find the best spreader hi 
with the highest probability in the giant component. Then we find V 2 that has the highest 
probability being in the giant component given the condition that hi is not in the giant 
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component. In this way, the node V 2 gives the maximal gain of the conditional probability 
that at least one of the node in V = {hi, h 2 } belongs to the giant component. The procedure 
is repeated until we find the set V = {vi,V 2 , ■ ■ ■ vm}- Below we outline our algorithm: 

1 . Starting from L candidate notes (seeds), we simulate the spreading process by activating 
their neighbours with probability /3, as in the SIR model. The spreading of L seeds are 
simulated simultaneously. We stop the spreading process from a source either when the 
component containing the source reaches size m, or the spreading dies out naturally 
(see Fig. CHD. The source belongs to the giant component in the former case, and to a 
non-giant component in the latter case. Note that because we perform the simulations 
of L seeds in parallel, the nodes in the component containing a certain source may be 
activated from other seeds. 

2 . We repeat step-(l) many times. Each time, we save the set of seeds that belong to the 
giant component as A^, where k = 1,2,..., K. 

3. Extract the best spreader vi that appears most frequently in the K sets .To = 
(Ai , A 2 ,..., Ak}- Remove any set A^ that contains vi from ^o- We denote the 
remaining as Ai, which only consists of set Ak that does not contain hi. 

4. Extract the second best spreader V 2 that appears most frequently in the remaining sets 
Ai. Remove any set Ak that contains V 2 from Ai, and denote the remaining as A 2 - 

5. Repeat step-(4) until we find the M best spreaders V = {vi,V 2 i ■ ■ ■, vm}- 

Since none of the steps in this algorithm iterates through all nodes/edges, the computational 
complexity is of order 0(1). 

4- Genetic algorithm (GA) 

The basic procedure of the GA is: (i) first we generate a number of initial individuals; (ii) 
then apply genetic operators (including crossover and mutation) to the initial individuals to 
generate the next generation; (iii) the new generation of individuals are selected according 
to their ranking of spreading power; (iv) we stop the procedure when a fixed number of 
generations are reached and choose the best individual from the last generation as the final 
solution. Below we describe the protocol in detail. 
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FIG. 18: Illustration of the PBGA algorithm. Simulations start from big nodes (seeds). For green 
seeds, the simulations die out before the number of activated notes reaches m — 50. For red seeds, 
the simulations are terminated once the number of activated nodes reaches m. 

• Initialization — From the candidate set W which contains L nodes, we randomly 
choose M nodes as spreaders. We call such a choice as an individual, i.e., each individual 
is a set V that consists of M selected spreaders. A total number of Hq individuals are 
generated at the initialization step. 

• Crossover — We randomly choose two individuals to apply the crossover operation. 
To cross over two individuals Vi and V 2 , we first find their intersection Vcro = Vi fl V 2 . 
If the size of Vcro is smaller than M, |Vcro| < M, then we add M — |Vcro| nodes randomly 
chosen from the set (Vi U V 2 ) \ (Vi fl V 2 ), where (Vi U V 2 ) \ (Vi fl V 2 ) is the set of nodes 
that belong to the union (Vi U V 2 ) but not to the intersection (Vi fl V 2 ). After this step, 
the size of Vcro is guaranteed to be M. We generate a total number of Ucro crossovered 
individuals. 

• Mutation — Each node in an individual V has a probability p to be deleted. Let the 
deleted set be V+ and the remaining set be V_ (note that V = V+ fl V_). We then 
randomly choose |V+| nodes from the set >V\V_ and add these nodes to V_ to generate 
a new individual Vmut, whose size is also M. The mutation is only operated to initial 
individuals but not to crossovered individuals. Thus we generate a total number of 
^mut = ^0 mutated individuals. 

• Selection — From initialization, crossover, and mutation, we obtain in total 2no + ncro 
individuals. These individuals are ranked according to their set spreading power5'(V). 
The best Uq individuals are selected for the next generation. 


40 



• Termination — The above steps form one iteration. We terminate the process after 
Wter iterations, and the final solution is the best individual in the last generation. 

In our simulations, we use the following parameters: no = 1000, Ucro = 500, p = 0.1, Wter = 
100. The computational complexity is proportional to N in getting the spreading power, 
whereas the other parameters do not depend on N. Thus the overall complexity is 0{N). 

B. Algorithms based on heuristic measures 

1. Maximum degree (MD) 

This method selects the first M nodes with the highest degrees (hubs), thus its compu¬ 
tational complexity is as low as 0(1). This method is fast but usually unreliable in real 
networks when (3 > j3c- This is because when {3 > {3c, the spreading can reach much further 
than the nearest neighbors. The MD method simply ignores network information beyond 
the nearest neighbors, as well as the degree-degree correlations among the hubs. 

2. Maximum k-shell (MK) 

K-shell (or k-core) index is a measure of the centrality of a node inside a network j^. In 
the k-shell decomposition, nodes with degree one (leaves) are removed successively until no 
more leaves are left. These removed nodes are labeled as k-shell index kg = 1. Next, all left 
nodes with degree two {kg = 2) are removed. The procedure is repeated, and is terminated 
until all nodes are labeled. As the name implies, the node with a high k-shell index is deeply 
embedded in the network’s core. Its computational complexity is 0{N), proportional to the 
network size. 

3. Maximum betweenness (MB) 

The betweenness centrality is another centrality measure of individual nodes. For the 
node of interest, betweenness is the sum of the fraction of shortest paths between any two 
other nodes in the network passing through this particular node. Thus the higher is the 
betweenness, the more central is the node. The computational complexity is O(Ai^) for the 
efficient betweenness algorithm developed in Ref. |41| . 
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4- Maximum closeness (MC) 


The closeness is associated with the distance of a node with all other nodes, where dis¬ 
tance is the length of the shortest path between two nodes in the network. It is defined as 
the reciprocal of the total distances between this node and all other nodes in a connected 
network. Therefore the node with the smallest sum of distances has the largest closeness. 
In terms of finding influential spreaders, the nodes with the higher closeness indices are con¬ 


sidered to be more influential spreaders. Its computational complexity 


421 is O(A^MogiV). 


5. Maximum Katz (MK) 


Unlike betweenness and closeness, the Katz centrality takes into account of all paths, 
rather than only the shortest path, between a pair of nodes (43|. The Katz centrality is 


based on the adjacency matrix. The longer path has an exponentially less weight, as each 
additional unit length in the path is penalized by a constant decay factor. The nodes with 
the higher Katz centrality correspond to better spreaders. In general, its computational 
complexity is 0{N^), although Ref j^] proposed that it is possible to lower the complexity 
to 0{N). 


6. Eigenvector method (EM) 

This method uses the eigenvector corresponding to the largest eigenvalue of the adja¬ 
cency matrix to estimate the influence of a node. The nodes are ranked according to their 
corresponding values in the eigenvector, with the largest being the most influential. Let the 
largest eigenvalue of the adjacency matrix B be A, and its eigenvector be v, then they satisfy 
Bv = Av. The elements with the larger value in v is given higher eigenvector centrality. 
This measure is related to the Katz centrality. Its computational complexity is 0{N). 


7. Maximum collective influence (MCI) 

The collective influence (Cl) of node i is defined as the product of its reduced degree 
{ki — 1) and the total reduced degree of all nodes at distance I from it, i.e., Glfli) = 
(ki — 1) “ 1)’ where 5Ball(b^) is the frontier of a ball of radius I (defined 
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as the shortest path) around node i [1^. The Cl is exact when £ —>■ oo, but can be well 


approximated using a finite value of This approximated version of Cl is a local measure 
with computational complexity 0(1). However, even the exact Cl should be considered as 
an approximation of <S(z), because: (i) by definitions, Cl only considers the central node i 
and all nodes on the frontier of a ball, while S{i) takes into account all nodes inside a ball, 
(ii) the Cl is originally derived for the Linear Threshold model, which is different from the 
SIR model considered here, (iii) the Cl is originally derived from the cavity method based 
on locally tree-like network structures, while real OSNs usually have short loops (note that 
S{i) does not require the network to be locally tree-like). Interestingly, we expect that our 
influence radius t gives a reasonably optimized £ in Cl, which may provide a solution for 


the question raised in Ref. 


27|. In this study, we use £ = 3 ~ £*. 


C. Comparisons of time complexities and performances 


The theoretical computational complexities of the above discussed algorithms are sum¬ 
marized in Table m Their performances in Macau Weibo network are compared in Fig. 0 



FIG. 19: Comparison of the algorithm performance /SCk) (normalized by a common factor s°°) of 
different algorithms in Macau Weibo {ji = 0.055) network, for different number of seeds M, with 
L — 100 candidates, (inset) The regime 1 < M < 6 is enlarged, where the rigorous optimum 
obtained from BFS is available. 
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TABLE I: Theoretical time complexities per node of different optimization algorithms. 


Algorithm PBGA 

MD 

MCI * 


Complexity 0(1) 

0 (1) 

0 (1) 


Algorithm NGA 

GA 

MRS 

EM BFS 

Complexity 0{N) 

0(A) 

0(A) 

0(A) 0(A) 

Algorithm MG 

MB 



Complexity O(AGogA) 0{N^) 



Algorithm MK 

Complexity 0{N^) 


>[c 

With the approximated version of CL 


X. PERFORMANCE BOUNDARIES OF PBGA 

Because the PBGA is a greedy algorithm, it does not necessarily give the optimal solntion. 
In this section, we discnss how to analyze its performance boundary. Note that the methods 
discussed below can be easily generalized to evaluate the performance boundaries of other 
algorithms. 

Let V = {vi,V 2 ,... iVm} be the best M spreaders identified by the PBGA, and V* = 
{n]‘, ^ 2 ,..., v\j} be the rigorons optimal solntion, then the performance ratio is 

PR = AA = 

5'(V*) p(V*,s°°)’ 

Below we discuss how to find lower boundaries for the PR(M) by establishing upper bounds 
of p(V*,s°°). Note that onr derivations are independent of the approximated scheme we 
used to calculate p(V, s°°) and p(V*, s°°). The rigorous boundaries I and II are “rigorous” in 
the sense that, if the exact valnes of p(V, s°°) and p(V*, s°°) are used, then these boundaries 
are also exactly valid. However, nnmerically we use the approximated values p(V, s°°) and 
p(V*, s°°) to replace p(V, s°°) and p(V*, s°°), in order to reduce the time complexity in the 
computation. 
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A. Rigorous boundary I 


Because p(V*, s°°) is the probability that at least one of the nodes in V* belongs to the 
giant component, it has to be smaller than the summation of individual probabilities, 

p{V*, s°°) < p{i, < Y (91) 


iev* 


ieu* 


where lA* = {ul, U 2 , ■ ■ ■, is the first M nodes with the maximum individual probabilities. 
Note that generally lA* could be different from the optimal solution V*. While V* is difficult 
to find, it is easy to find the set lA* by simply calcnlating and sorting individual probabilities. 
We then have a rigorous lower boundary, 

p(V,s°°) 


pRmm ^ 


(92) 


such that PR > PR™'". This boundary works well when M is small. In particular, when 
M = 1, the set lA* is identical to the optimal solution V*, and therefore PR = PR™‘". 


B. Rigorous boundary II 

Because our M spreaders are selected from L candidates from a given set W, obviously the 
probability p{V, s°°) for any V (including the optimal V*) should be smaller than p(>V, s°°), 
which is the probability that at least one node in the candidate set W belongs to the giant 
component. We define the second rigorous lower boundary PR™'" as, 

pRmin _ ^ ) j-ggN 

2 p{W, S-) ’ ^ ^ 

where the joint probability p(W, s°°) can be computed unambiguously without knowing V*. 
This boundary works well for large M. When M —>• L, it is clear that V* — W (every 
candidate has to be selected), and therefore PR PR™'". The combined strict boundary is 
the supremum of above two rigorous boundaries, PR™',^j^ = max{PR™'", PR™'"} (see Fig.EOl)- 


C. Approximated boundary 

The third boundary we discuss here only applies rigorously to random networks, and is 
considered as an approximated boundary for real OSNs. We first approximate p{V*, s°°) as 

p(V*,s°°) ~ 1 - JJ[1-p(z,s°°)]. (94) 

iev* 
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FIG. 20: The performance boundaries of PBGA in Macau Weibo network (/3 = 0.055). The com¬ 
bined rigorous lower bound the approximated lower bound PR^'prox) the submodular 

lower bound PR™bmod ~ 0.63, are plotted as functions of M {L — 100). 


The equality holds for large random networks, wherein notes are completely uncorrelated. 
Given U* as the set of nodes with the largest M individual probabilities, from Eq. fIMl) we 
further obtain 


s°°) < 1 - JJ [1 - p{i, S~)]. 


(95) 


Combing this result with the dehnition Eq. flOOll gives our approximated lower bound (see 

Fig.EOD, 


PR 


min 

approx 


1 - -p(bs°°)]' 


(96) 


XI. NON-VIRAL (LOCALIZED) SPREADING 


In the main text, we focus on the viral spreading case with j3 > j3c- Eor random networks 
with power-law degree distributions, /3c = {k'^)l{k) ~ O’ thus the condition /3 > /3c 
always holds (/3 is non-negative by definition). Although real OSNs usually also have power- 
law degree distributions, their critical thresholds /3c are generally nonzero, due to complex 
network structures and finite-size effects. Fortunately, the spreading problem in the non-viral 
spreading phase is much simpler, because the spreading power usually can be well estimated 
using only the first and the second nearest neighbors (as shown in Fig. El]). In this case, 
the problem of selecting the best spreaders can be mapped onto a covering problem. In 
particular, the method developed in Ref. H can be efficiently applied. 
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FIG. 21: In the case of non-viral spreading, the spreading power is well estimated by using only first 
and second nearest neighbors. Data are obtained for NOLA Facebook with (3 — 0.005 (/3c = 0.01), 
and for Macau Weibo with (3 — 0.01 (/3c = 0.05). Here S{i) is the exact spreading power, and S^{i) 
is the estimated spreading power using only first and second nearest neighbours. 
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